{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# MFE 230P: ASSIGNMENT III\n",
    "**YOUR STUDENT ID** \\\\ YOUR NAME \\\\ YOUR GROUP NAME"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. Feature Engineering & Spectral $k$-means"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A. Similarity from Distance\n",
    "\n",
    "Show that if $d(\\cdot, \\cdot)$ is a [distance metric](http://bit.ly/2qMIu7d), then\n",
    "\n",
    "$$\\kappa_\\gamma(x, y) = \\exp\\left\\{-\\gamma \\cdot d(x, y)\\right\\}$$\n",
    "\n",
    "is a [positive-definite kernel](https://en.wikipedia.org/wiki/Positive-definite_kernel) for all $\\gamma > 0$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### SOLUTION.\n",
    "\n",
    "_Your solution here._"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### B. Feature Design\n",
    "\n",
    "Executing the cell below will create a pandas dataframe `data` containing 2015-16 daily adjusted percent returns for the top 99 US companies by market cap (as of 31 December 2016)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "data = pd.read_csv(\n",
    "    '../../data/top_99_returns.csv',\n",
    "    header=0,\n",
    "    index_col=0\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Engineer a few features (5-10) that, in your judgement, would be effective in characterizing the daily equity returns of each company. For example, (maximum/minimum) or average squared return—be creative and use your instincts. Then, project the asset returns `data` into to this new feature space by creating a new matrix `data_fe` where each row represents a company and each column represents a feature you engineered.\n",
    "\n",
    "Normalize each column of `data_fe` to have zero mean and unit standard deviation. Intuitively ration why we perform this standardization."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### SOLUTION.\n",
    "\n",
    "_Your solution here._"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### C. Kernel Matrix\n",
    "\n",
    "Create a square kernel matrix `kernel` and `kernel_fe` by applying the kernel transformation discussed in part A to both the raw returns in `data` as well as their projection into your feature space, represented in `data_fe`. You may set $\\gamma = 1$ and use the distance metric $d(x,y) = \\|x - y\\|_2^2$ or choose your own distance metric (just make sure it is a distance metric)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### SOLUTION.\n",
    "\n",
    "_Your solution here._"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### D. Spectral Clustering\n",
    "\n",
    "Perform 3-means clustering on the top 3 princicpal components of both `kernel` and `kernel_fe`. Print the tickers associated with each of the three clusters for both `kernel` and `kernel_fe`.\n",
    "\n",
    "Which method provides more intuitive groupings (this is a subjective question)?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### SOLUTION.\n",
    "\n",
    "_Your solution here._"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### E. Interpreting the Results\n",
    "\n",
    "In light of the previous analysis, what are the advantages and disadvantages of useing engineered features in place of the raw data? How would the analysis be different if we had implemented $k$-means on the raw returns (without a kernel transform)?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### SOLUTION.\n",
    "\n",
    "_Your solution here._"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2. Least Squares and LAD"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A. LAD Regression is a Linear Program\n",
    "\n",
    "Show that LAD regression can be written as a linear program in standard form. What does this mean computationally as compared to the equivalent problem in which $\\ell_1$ replaced with $\\ell_2$?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### SOLUTION.\n",
    "\n",
    "_Your solution here._"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### B. Adaptive Index Quantiles with Least Squares\n",
    "\n",
    "Executing the cell below will create a pandas dataframe `X` containing 2015-16 daily adjusted percent returns for the top 500 US companies by market cap (as of 31 December 2016) and another pandas dataframe `y` containing daily adjusted returns for the `SPY` ETF over the same period, which tracks the S&P 500 index closely."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X = pd.read_csv(\n",
    "    '../../data/top_750_returns.csv',\n",
    "    header=0,\n",
    "    index_col=0\n",
    ")\n",
    "\n",
    "y = pd.read_csv(\n",
    "    '../../data/spy.csv',\n",
    "    header=0,\n",
    "    index_col=0\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Split the data into training and validation sets, where the first 75% or so of dates constitute the former.  Using only `cvxpy` and `numpy`, perform [least-norm regression](https://see.stanford.edu/materials/lsoeldsee263/08-min-norm.pdf) on the training set, regressing `SPY`'s return on the lagged returns of the top 750 companies:\n",
    "\n",
    "$$\\min_{\\theta} \\|\\theta\\|_2^2 \\text{ subject to } y_t = X_{t-1}\\theta$$\n",
    "\n",
    "Plot a historgram of your model's residuals on the validation set. Does the distribution appear to be Gaussian? Check your hypothesis using a statistical test such as [Shapiro-Wilks](https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.stats.shapiro.html).\n",
    "\n",
    "One way to forecast the 1-step-ahead 95th percentile of index returns for `SPY` using lagged returns of the top 750 companies is to follow steps similar to those outlined in [this stack exchange post](https://stats.stackexchange.com/questions/147242/how-to-calculate-the-prediction-interval-for-an-ols-multiple-regression). Comment on the potential effectiveness of this method in light of the out-of-sample residual distribution you previously analyzed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### SOLUTION.\n",
    "\n",
    "_Your solution here._"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### C. Adaptive Index Quantiles with LAD\n",
    "\n",
    "Perform quantile regression on the training set to estimate the 1-step-ahead 95th percentile of the `SPY`. Use the setting `solver=CBC` when calling the `.solve()` method in `cvxpy`. How does this model perform on the validation set? Plot the daily returns of `SPY` with the forcasted $95^{th}$ percentile superimposed.\n",
    "\n",
    "Speculate about how this model would perform in practice compared to the least squares model in the previous part."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### SOLUTION.\n",
    "\n",
    "_Your solution here._"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### D. Adaptive Index Quantiles with Ridged LAD\n",
    "\n",
    "Repeat the previous exercise, but with an $\\ell_2$ penalty on the vector of regression coefficients $\\theta$:\n",
    "\n",
    "$$\\theta_\\lambda = \\arg\\min_{\\theta} \\|y_t - X_t\\theta_\\lambda\\|_1 + \\lambda \\|\\theta\\|_2$$\n",
    "\n",
    "Tune the $\\ell_2$ penalty parameter $\\lambda$ using the validation set to obtain the optimal penalty $\\lambda^\\star$. How is the vector of regression coefficients $\\theta_{\\lambda^\\star}$ different from $\\theta_0$? How can you interpret this from an investing viewpoint? How does this model compare, on the validation set, to the model in part C? Explain."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### SOLUTION.\n",
    "\n",
    "_Your solution here._"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3. Support Vector Machine\n",
    "\n",
    "In this question, we consider an application of SVM in text classification for volatility prediction. Executing the following cell will load a $1470 \\times 971$ `numpy` array into a variable `features` where each row corresponds to a published article and each column corresponds to the frequency of a keyword that appears in the article (i.e. our archive contains $1470$ articles and our dictionary contains $971$ keywords). Each article is about a certain company. Binary labels will also be imported as a $1470 \\times 1$  `numpy` array `labels`. An article's label is $+1$ if the article caused an immediate and significant change (positive or negative) to the company's stock price. Otherwise, the label is $-1$. The data has been divided into a training set, which will be used to train you SVM, and a validation set which will be used to test the SVM's prediction accuracy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "features = # load data here\n",
    "labels = # load data here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A. $\\ell_2$ SVM\n",
    "\n",
    "The ridged-SVM classification problem can be formulated as the following optimization problem:\n",
    "\n",
    "$$\\underset{w, b}{\\text{min }} \\frac{1}{2}\\left\\|w\\right\\|_2^2 + C\\sum_{i=1}^{1470}{\\left(1 - y_i\\left(w^\\top x_i + b\\right)\\right)_+}$$\n",
    "\n",
    "where $y_i$ denotes the $i^{th}$ label, $x_i$ denotes the $i^{th}$ vector of word frequencies in the articles, $w$ is the weights or vector of coefficients, $b$ is the offset or intercept, and $C$ is a model parameter is inversely related to the ridge regularization of the weights vector $w$. This is a quadratic optimization problem.\n",
    "\n",
    "Using `cvxpy`, implement this SVM (estimate the $w$ and $b$ parameters) on the training set and tune the parameter $C$ from $0$ to $100$ by checking classification accuracy on the validation set. Plot the training accuracy versus $C$ curve and validation accuracy versus $C$ curve. Briefly comment on the results."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### SOLUTION.\n",
    "\n",
    "_Your solution here._"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### B. Feature Selection by Magnitude\n",
    "\n",
    "We want to find the keywords that are most important for classification. Consider the following approach: Fix $C=10$ and perform SVM to estimate $w$ on the training set. Sort elements of $w$ by their absolute value in descending order, choose a the top $k$, and then perform SVM on this subset of the features. What are some advantages or disadvantages you anticipate in approaching feature selection in this manner?\n",
    "\n",
    "Try $k\\in \\{10, 20, 30, \\dots, 190, 200\\}$ and evaluate classification accuracy on the validation set. Comment on the result."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### SOLUTION.\n",
    "\n",
    "_Your solution here._"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### C. $\\ell_1$ SVM\n",
    "\n",
    "Consider a different approach for feature selection: Subsitute the $\\ell_2$-norm penalty with an $\\ell_1$-norm penalty in the SVM objective:\n",
    "\n",
    "$$\\underset{w, b}{\\text{min }} \\left\\|w\\right\\|_1 + C\\sum_{i=1}^{1470}{\\left(1 - y_i\\left(w^\\top x_i + b\\right)\\right)_+}$$\n",
    "\n",
    "How might this approach compare to the previous feature-selection approach suggested in part B above?\n",
    "\n",
    "Sweep the parameter $C$ in the interval $[0, 100]$ and plot the number of non-zero elements in $w$ versus the prameter $C$. Note that due to `cvxpy`'s limited numerical precision, zero elements are not exactly $0$. So the criterion for non-zero element is $|w_i| > 10^{-6}$. Perform a similar plot of validation set accuracy versus the parameter $C$. Comment on the results."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### SOLUTION.\n",
    "\n",
    "_Your solution here._"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
